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Abstract. Natural history museum collections provide the basic documentation of life on Earth. As such, they represent the 
critical and unique resource by which that life may be understood, and have immense economic and scientific importance. 
Nevertheless, particularly in recent decades, natural history museums have seen less and less attention — and resources — in 
spite of their importance. A scries of new efforts, however, aim to recoup that prominence via community efforts to unite 
data resources towards a vastly improved understanding of biodiversity and its implications. The Species Analyst represents 
an effort to unite natural history collections databases worldwide to this cnd: 77 institutions now cooperate or are commit- 
ted to cooperate in serving records of 51 million natural history museum specimens to users worldwide, and has seen more 


than 700,000 users to date. 
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1. INTRODUCTION 


Computerization of ornithological collections is ın- 
creasingly considered a priority for curators and staff 
of natural history museums. A common quandary, 
however, is how and why to get started. The curator ts 
presented with a bewildering variety of databasing 
programs, some especially designed for specimen 
records, and others off-the-shelf generic database pro- 
grams that can be customized for any use. Choice of 
a platform, choice of data fields, and choice of com- 
puterization strategy all become critical — and difficult 
— consideration. Unfortunately, these considerations 
can often seem so complex that computerization 
efforts are not initiated. 


Moreover, presented with a thousand and one other 
priorities of collections building, specimen conserva- 
tion, institutional politics, and research efforts, and 
given the significant time investment that computeri- 
zation requires, the question arises as to whether the 
result 1s worth the time. That is, one must consider 
what are the benefits of computerization, and how 
much do they benefit the collection, the curator, and 
the broader community. 


The purpose of this contribution is to provide a ration- 
ale for computerizing bird collections as a critical step 
forward in their care. Along the way, we review steps 
involved — a sort of minimum-standard guide to start- 
ing computerization efforts. Finally, we provide a 
series of examples of how computerizing collections 
data, and sharing those data across many institutions 
worldwide, benefits the collections themselves. 


2. WHY COMPUTERIZE A COLLECTION? 


Databasing or computerizing a collection is a lot of 
work, and may easily absorb years of effort. So why 
do it? Several reasons argue strongly for taking this 
step. A partial list follows: 


~ Get to know your collection — a sweep through the 
whole collection, drawer by drawer, gives a 
unique knowledge of a particular collection. 

— Discover important specimens — many fascinating 
discoveries have resulted from the specimen-by- 
specimen attention during computerization efforts, 
including species new to science, lost type speci- 
mens, important historical specimens, etc. 

— Detect problems — again, the specimen-by-specimen 
attention can help to detect serious problems that 
might otherwise not be noticed ... damage from 
insects or water, fading of plumages, drying of 
spirit specimens, etc. 

— New views of the collection — although we are famil- 
iar with summaries of collections in terms of tax- 
onomic completeness, and perhaps regional sum- 
maries, many new views of collections open when 
a collection is computerized, e.g., maps of the geo- 
graphic distribution of specimens, summaries of 
accessions over time, etc. 

— Save curatorial time — making summaries of hold- 
ings, preparing loan invoices, tracking down par- 
ticular specimens, and many other curatorial tasks 
are considerably more efficient when the collec- 
tion is available in database form. 

— Standardize taxonomy — once data are ın electronic 
form, comparing names against a standard list (e.g.. 
the Peters’ check-list) can identify a first set of non- 
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standard names that require checking and updating. 
— Efficient information access — many questions and 

data requests that require hours or days of work for 

an uncomputerized collection will suddenly be- 

come feasible to answer in minutes, making possi- 

ble much more creative uses of the information in 

collections. For example, 

— What are your holdings of taxon X? 

— What are your holdings from country X? 

— Do you have specimens collected by person X? 

— What ts the history of specimen acquisition rates 

in your collection? 
— And many more ... 


In short, computerization of a collection ts a major 
undertaking, but ends up repaying the investment of 
time and effort many times over. 


3. CHOOSING A PLATFORM 


The first big question to be answered is about 
which platform (databasing program) to use. This 
decision becomes complex ... sometimes, museum 
administrators decide to force all collections tn the 
museum to use the same program. Even if one has 
the freedom to choose, should one choose among 
the many programs that have been developed 
specifically for natural history museum specimens 
(BIOTA, BIOTICA, SPECIFY, etc.), or a generic 
program off the shelf (e.g., Microsoft Access, Ora- 
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cle)? Regarding this choice, each option has its 
strengths and weaknesses (Table 1). In general, we 
would recommend the off-the-shelf option for 
small, old, or inactive collections, and the speci- 
men databasing programs for larger, data-rich, and 
very active collections. 


Regardless of this choice, one should insist on sev- 
eral minimum criteria for a databasing platform. 
These criteria are critical features of a program 
that must be fulfilled in order to avoid problems. 
As follows: 


— Capacity for export to other, generic formats, 
particularly ASCII delimited format, to allow 
reporting, export to other programs, and porting 
to future technologies and platforms. 

— Compatible with Standardized Query Language 
(SQL), which permits many functionalities to 
be added to your database related to sharing 
data. 


Once a platform has been identified that fits the 
particular needs of a collection, and meets these 
basic requirements, then design of the computeri- 
zation effort can begin. 


If the reasoning outlined above suggests that the 
best solution to computerization is that of a more 
complex program specifically designed for natural 
history specimen data, then you should read about 
several of the programs that are available. Links to 
a number of such programs are presented in Table 2. 


Table 1: Summary of advantages and disadvantages of specialized versus generie programs as platforms for computeriz- 


ing bird eolleetions. 
NATURAL HISTORY MUSEUM SPECIMEN 


DATABASING PROGRAMS 


Advantages 


Designed specifically for specimen management 


Features such as authority lists, loan invoice reporting, etc. 


No customization or little customization required 


Most complex solutions specific to natural history speci- 
mens are tractable 


Disadvantages 


Can disappear — long-term support often dcpends on a 
person — researcher or developer — who can decide not to 
support the program further, or who may decide not to 
update to newer versions (e.g., MUSE) 


Expert advice may be unavailable in a particular city 
May not permit very simple solutions to simple problems 


Steeper learning curve 





OFF-THE-SHELF GENERIC 
DATABASING PROGRAMS 


Long-term continuity of support from the company 


Easy availability of expert advice, given broad usage in 
many communities 


Simplest solutions are feasible 


Simple learning curve 


May necd customization of program for intermediate-to- 
complex situations 


Not designed specifically for specimen data 


Complex features (c.g., reporting, authority lists) not auto- 
| , rep E. y 
matically available 
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Table 2: Selected specialized programs designed specifi- 
cally for collections data. Provided are World Wide Web 
links for more information. 


Program 


SPECIFY 
Biótica 


http://usobi.org/specify/ 


http://www.conabio.gob.mx/biotica 
_ingles/distribucion_b. html 


BioLink 

BIOTA 

KE EMu (not re- 
commended for 
integration via 
Species Analyst) 


http://www. biolink.csiro.au/ 
http://viceroy.eeb.uconn.edu/biota 
http://www.kesoftware.com/ 


4. CHOOSING DATA FIELDS 


This step may prove to be the most critical of all in the 
process of computerization. With too many fields, 
time and filespace are wasted, whereas with too few, 
they will have to be added later or one will have to 
live without them. If an incorrect structure 1s chosen, 
the database may be forever handicapped by this 
design flaw. However, the challenge is reduced quite 
a bit with an understanding of a few basic ideas. Spec- 
imen data, in their simplest form, distill down to three 
linked sets of information about each specimen: 


— Taxonomic information — the taxonomic identity of 
the specimen 

— Geographic information — the geographic location 
of its collection 

— Detailed documentation of the specimen — time of 
collection, collector identity, museum catalogue 
number, sex, age, body mass, etc. 


Thinking in this manner, we can envision a structure 
for a specimen database that would capture this infor- 
mation optimally. Taxonomy and geography are both 
hierarchical concepts, and so we can represent them 
as such, which would make for three interacting sets 
of information (Fig. 1). 
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In the simplest sense, then, even in a spreadsheet pro- 
gram such as Microsoft Excel, or (better still) as a sin- 
gle table in a database program such as Microsoft 
Access, one could use a straightforward single table 
that holds critical fields (see Table 3). This very sim- 
ple structure provides a clear, workable solution for 
small collections. In a more complex situation, in 
which more specimens are to be computerized, this 
structure can be made relational (Fig. 1) (that is, made 
up of several tables that interconnect). The advantage 
of a relational database structure is that elements of 
the database are entered only once: e.g., the locality 
descriptor for the 150 specimens collected at USA/ 
Kansas/Douglas Co./Lawrence/10 km E is entered 
only once, reducing the possibility of typographical 
errors. 


Table 3: Critical minimum set of fields for a simple col- 
lections database. 


Field Example 


15230 
Cyanocitta 
cristata 

cristata 

24 October 1956 
Fredrick E. Jones 
SEX Female 

Age Adult 

Body mass 120 g 

Country USA 

State or province Ohio 

County Butler Co. 
Named place and directions from Oxford, 10 km E 


Catalogue number 
Genus 

Species 
Subspecies 

Date of collection 
Collector 


This sort of simple relational structure can be imple- 
mented in a program such as Microsoft Access with a 
few hours’ attention by a technician familiar with the 
program. The custom specimen database programs 
use a more complex relational struc- 
ture, but one that is in essence based 
on this overall backbone. Again, the 
more complex the demands that one 
will wish to place on the database 









© Species e Unique catalogue directions from (e.g., more complex queries, more 
Bi ejus number e County detailed reporting, more specimens), 
e Family e Date of collection e State or province the more complex the database struc- 
order e Collector e Country ture that will be required. For rela- 
e Sex e Region tively simple applications, however, 
° Age the simple flat file (single table) 


setup described above will often be 
adequate. 


Fig. 1: Diagrammatic illustration of a simple relational database structure 
designed to link hierarchically organized geographic and taxonomic informa- 


tion with specific data regarding a particular specimen. 
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5. COMPUTERIZATION STRATEGY 


The next question to be faced ts the strategy for com- 
puterization. This decision depends heavily on the 
exact situation of a collection. If, on the one hand, an 
excellent paper catalogue or card file exists, one may 
wish to computerize directly from that, and then ver- 
ify the accuracy and completeness later from the 
actual specimens. If, on the other hand, a good card 
file or catalogue does not exist, or 1f many specimens 
may have been omitted (exchanged or deaccessioned) 
or not entered tn the catalogue, then you may be bet- 
ter off computerizing directly from specimens. 


In general, two passes through the collection will be 
necessary as part of any computerization effort. The 
first will simply get each specimen’s data into the 
computer as efficiently as possible. The second will 
verify (1) the existence of the specimen, (2) that all 
data elements arc entered in the database, and (3) that 
all of the specimen’s data are correct as entered. This 
verification step, although labor-intensive, 1s critical 
to making the database a correct representation of the 
information contained in the specimens’ labels. 


All computerization efforts should involve the critical 
step of backing up data at regular intervals. Too many 
‘impossible accidents’ have removed a year of work, 
and set a computerization effort back terribly. Back- 
ing up data should be done as permanently as possible 
... that is, compact disks are better than floppy disks. 
It should also be done with redundancy: each time 
that you make a copy, if at all possible, it should not 
over-write the previous copy. This preservation of 
‘versions’ of the database allows one to go back a 
week or a month if some error appears in the data set. 
Finally, given the possibility of more catastrophic 
losses, the back-up copies should be stored off-site, 
preferably 1n several places. Excellent storage sites 
for these copies can include libraries or archives, or 
curators’ homes, or they can even be transferred via 
the Internet or via mail to another country. 


6. THE SPECIES ANALYST (TSA) 


The Species Analyst (http://speciesanalyst.net/) is a col- 
lection of software tools that permits integration of 
computerized collections data among institutions 
around the world into a distributed biodiversity infor- 
mation facility. For example, a user might wish to ask 
for records of any taxon from Yellowstone National 
Park or from Burma, or all specimens collected by 
Alexander von Humboldt, and retrieve information in a 
matter of seconds from 50 institutions around the world. 


TSA uses a hybrid of Z39.50 (an information transfer 
protocol developed about 20 years ago in the biblio- 
graphic community) and XML (a more modern and 
efficient protocol) to permit efficient query and 


retrieval of data. TSA may be accessed via a web por- 
tal that permits basic queries, or via extensions to 
Microsoft Excel (for retrieval of data in spreadsheet 
format) and ArcView (for retrieval of data as GIS 
coverages) (downloads available at http://speciesana- 
lyst.net/downloads). 


TSA currently mtegrates data sets from 22 institu- 
tions, for a total of 15 million specimen data records 
for over 50,000 species; a total of 58 institutions has 
committed to participation formally, which will take 
the total number of specimen records served to about 
50 million. A special strength at present is in ichthy- 
ological data, as FishNet (http://speciesanalyst.net/ 
fishnet/) has taken excellent advantage of TSA tech- 
nology to create a data facility linking most important 
computerized fish collections. Now funded is a paral- 
lel network for mammal collections data (MANIS, 
based at the Museum of Vertebrate Zoology; http:// 
elib.cs.berkeley.edu/manis/), and networks for her- 
petological and ornithological (expanded) specimen 
data are pending and in preparation, respectively. 


7. WHY SHARE DATA ONCE 
COMPUTERIZED? 


Above, we listed the first set of benefits of computer- 
ization of bird collections — namely, freer and more 
complete access to the information content of the 
specimens that make up the collection. These benefits 
are indeed considerable, and add enormously to a 
curator’s ability to take care of a collection. However, 
once data are computerized, if they are shared, and 
integrated with data from other collections around the 
world, an additional set of benefits accrues. 


In essence, a set of emergent properties comes into 
being once all (or nearly all) data are integrated for a 
particular taxon or region. We have come to appreci- 
ate these emergent properties as we have assembled 
the Atlas of Mexican Bird Distributions (NAVARRO & 
PETERSON, in prep.), a centralized database now 
including the contents of more than 60 natural history 
museum collections of Mexican birds. This I1-year 
project has resulted in a diversity of synthetic publi- 
cations regarding the Mexican avifauna (NAVARRO- 
SIGUENZA et al. 1992a, b: PETERSON 1993; PETERSON 
et al. 1993; PETERSON 1998; PETERSON et al. 1998a, b; 
NAVARRO-SIGUENZA & PETERSON 1999, 2000; PETER- 
SON et al. 2000, 2001, 2002). Herein, we will use this 
exemplar data set to demonstrate a variety of potcntial 
benefits to broad integration of data across institu- 
tions, as follows: 


7.1 Georeferencing as a Community 


Georeferencing locality data for specimens opens 
doors to a multitude of new capabilities and new func- 
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Fig. 2: Map of Mexico with collecting localities plotted by numbers of specimens collected at each point (graded symbol 
size: smallest = | specimen, largest = >100 specimens). For five points, to illustrate the redundancy of collecting localities 
among museums, we provide pie diagrams that illustrate the relative holdings of specimens from that particular site among 
scientific collections (see Acknowledgements for institutions and abbreviations). 


tionalities to collections data. Indeed, all of the 
advances of geographic information systems (GIS) 
open up to collections data once latitude and longi- 
tude data are available for the collecting localities for 
each specimen. Nevertheless, georeferencing collec- 
tions data — even once they are in electronic form — 
represents an enormous task. 


Integrating this task over many institutions, however, 
takes advantage not just of having more people to 
help in a large task, but also of the redundant nature 
of the geographic sampling of birds (Fig. 2). Indeed, 
more than 25 % of Mexican bird collecting localities 
occur in more than one museum, and some in more 
than 20 museums. This redundancy results from col- 


lections being dispersed among numerous museums 
(e.g., the specimens of Wilmot W. BROWN from 
Chilpancingo, Guerrero), and from certain sites being 
especially accessible or well-known as collecting 
localities in particular regions (e.g., Cerro San Felipe, 
Oaxaca). 


A first experiment in cooperative georeferencing is 
beginning in the mammal community in the United 
States. The MANIS network, a U.S. National Science 
Foundation-funded effort, is connecting 17 institu- 
tions with computerized holdings of mammal speci- 
mens. A first step in MANIS integration efforts is the 
pooling of institutional lists of localities to be georef- 
erenced; institutions are then ‘signing up’ for particu- 
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lar regions, perhaps a home state, or an area of partic- 
ular interest to the curator. In this way, efforts in geo- 
referencing have a direct return for a particular inves- 
tigator or institution, and add to the community pool 
of georeferenced information. 


7.2 Detecting Errors in Date and Locality 


Once specimen data are integrated, and have been geo- 
referenced, further data refinements are possible. A 
common question is that of the relative reliability of 
the data associated with specimens from different col- 
lectors (BINFORD 1989). Because of the fragmented 
and dispersed nature of collector’s material it has 
always been out of reach before. For instance, the still- 
living collector and ornithologist Robert W. DICKER- 
MAN has deposited specimens at 14 of the 32 museums 
included in our present summary; the early twentieth 
century collector Wilmot W. BROWN has specimens 
distributed across 23 of the 32 museums. Once these 
data are pooled, however, new insights become possi- 
ble regarding collectors’ relative reliability. 
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Fig. 3: Maps of collecting localities for two contrasting 
groups of collectors in Mexico: a Museo de Zoologia, 
UNAM (MZFC) expedition in Spring 1991, and the collec- 
tions of Mario del Toro Aviles in June 1949, Organized by 
collections date, consistencics and inconsistencies of spec- 
imen labeling become clear. 
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Basically, by assembling the entire opus of a collector, 
and sorting specimen locality by collecting date, it is 
possible to assess how geographically reasonable the 
combination of dates and localities is. Hence, to pres- 
ent a contrasting pair of examples, a Museo de 
Zoologia, UNAM, expedition in 1991 scouted numer- 
ous sites in central and eastern Oaxaca (Fig. 3, top): 
although its route was complex, specimens from par- 
ticular localities were clumped in time, and a sensible 
route could be reconstructed (although, ın constructing 
this example, we detected an error in our georeferenc- 
ing ... the ‘Benito Juarez’ referred to in the locality 
descriptor was the one in eastern Oaxaca, not the one 
in central Oaxaca). In stark contrast, specimens scat- 
tered across four museums (MLZ, LACM, FMNH, 
USNM) suggest that the infamous collector Mario del 
TORO AVILES worked at several sites across Mexico in 
June 1949; plotting these localities by date, however, 
reveals a number of points at which impossibly long 
journeys would have had to have been made in too 
short a time (Fig. 3, bottom). This result confirms ear- 
lier suspicions that del TORO AVILES’ dates and locali- 
ties are to be regarded with utmost caution (BINFORD 
1989; PETERSON & NIETO-MONTES DE OCA 1996). 


This approach can be used to detect problems ın col- 
lectors’ series, which will either be errors in date of 
collection or ın collecting locality. Indeed, for an ınte- 
grated, distributed data set consisting of the holdings 
of many institutions, it could be implemented as an 
error-seeking module that scans the data set collector 
by collector, and flags particular records as potential 
problems. These flagged specimen lists could then be 
distributed to collection curators for checking. 


7.3 Detecting Errors in Identification or 
Georeferencing 


A further refinement to specimen data also becomes 
possible, which will detect problems either 1n species 
identification or in georeferencing of localities. In 
essence, by viewing large quantities of occurrence 
data for a particular species, it is possible to detect 
spatial outliers, which likely represent identification 
or georeferencing problems. This process can be 
refined still further via ecological niche modeling for 
species: the ecological needs of a species are modeled 
(PETERSON 2001; PETERSON et al., in press) using 
high-end computational tools (STOCKWELL & NOBLE 
1992; STOCKWELL 1999; STOCKWELL & PETERS 1999). 
These procedures use known occurrences of a species 
to produce a geographic view of areas meeting and 
not meeting its ecological needs; overlaying the same 
known occurrence points used to build the models 
allows identification of outher occurrences. 


As an example of this approach, we used the known 
occurrences of the brush-finch Arlapetes pileatus to 


Computerizing Bird Collections and Sharing Collections Data Openly AN 





Questionabte 
localfv 


Fig. 4: Map of known collecting localities for the brush- 
finch Arlapetes pileata, overlain on a map of regions fitting 
the modeled ecological needs of the species (in gray), 
showing an old coastal locality in Tamaulipas as falling 
outside of the species’ ecological niche. 


build an ecological model and identify areas of appro- 
priate and inappropriate ecological conditions for the 
species (Fig. 4). The modeling algorithm used is 
detailed elsewhere (STOCKWELL & NOBLE 1992; 
STOCKWELL 1999; STOCKWELL & PETERS 1999; 
PETERSON 2001; PETERSON et al., in press), but the 
result is that all known occurrence points fall into 
areas predicted to be appropriate for the species 
except one. This point (Fig. 4) represents an old local- 
ity on the coast of Tamaulipas, in the lowlands of 
eastern Mexico. The ccological modeling proccdure 
identifies this site as a specimen locality that is not 
within the ecological possibilities of the species, and 
most likely represents an erroneous locality designa- 
tion. 


Like the collector itinerary approach, a procedure 
based on ecological niche modeling could be imple- 
mented as an error detection facility. A computer 
could periodically scan the pooled data rcsources for 
known occurrence points of each species, build eco- 
logical niche models for each species, and detect 
occurrence points that fall outside the ecological Hm- 
its of the species. These points can then be flagged for 
checking by curators or collections staff. 


7.4 Community-wide Activities: 
The Power of Numbers 


Much more generally than for the preceding exam- 
ples, it is important to emphasize the power of work- 
ing of a community. When a proposal stems from a 
Division of Mammalogy at a particular museum, it 
carries far less force than a proposal that comes from 
all of the Mammalogy divisions from 17 institutions. 
This power of numbers — working as a community — 
makes possible many bold new funding initiatives. 


Indeed, in the Species Analyst effort, several such 
community proposals have already been prepared, 
and have proven enormously successful. Proposals 
have been prepared and funded for a pilot North 
American bird network (U.S. National Science Foun- 
dation, funded 1998), a 15-member fish data network 
(U.S. National Science Foundation and U.S. Office of 
Naval Research, funded 2000), and a 17-member 
mammal data network (U.S. National Science Foun- 
dation, funded 2001). This success clearly results 
from the community nature of the proposals, and has 
resulted in more than $2 million of new funding being 
available to the systematics collections community. 


More generally, community efforts constitute an 
important step towards demonstrating the power of 
the systematics collections community in many real- 
world challenges. Work as a community shows the 
true analytical power of the data that the systematics 
collections community holds. This power ıs a key in 
convincing funding agencies, museum administrators, 
and decision-makers in general of the importance of 
systematic collections. 


8. CONCLUSIONS 


The point of this piece is that computerization is not a 
prohibitively difficult or expensive endeavor; rather, 
it is an important step in curating a collection that 
more than pays for itself in (1) saving time and effort 
in curatorial activities, (2) improving data quality and 
removing erroneous elements, and (3) improved fund- 
ing possibilities and recognition by administrators and 
decision-makers. Most important is to make some 
simple decisions, start into the task, and methodically 
carry it out. 
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